from IPython.display import Image
Image('/home/nikolay/WABI/Misc/SingleCell/Presentation/BayesianDeepLearning/segmentation_uncertainty.png')
Bayesian Deep Learning became extremely popular a few years ago with the development of self-driving cars, although the idea was not entirely new, already in the late 1980-th there was a literature essentially discussing implementation of Priors in the weights of Neural Networks, please watch an excellent talk of Zoubin Ghahramani if you would like to learn more about history of Bayesian Neural Networks.
"Bayesian Deep Learning" sounds a bit silly at first glance. Why would you want to go Bayesian if you do Deep Learning? Doesn't Deep Learning automatically imply that you works with lots of data? In contrast, Bayesian Statistics is popular in e.g. ecology, psychology primarily because of the lack of data. That is why Priors are needed to compensate for the lack of data. Without Priors, which play a regularization role, especially in the highly-dimensional space when n (number of observations) << p (number of features), the math blows up like for the case of linear regression:
$$
Y = \alpha + \beta X\\
\beta = \left(X^TX\right)^{-1}X^TY\\
\left(X^TX\right)^{-1} \sim \frac{1}{\rm{det}\left(\left(X^TX\right)^{-1}\right)}\dots \rightarrow \infty, \quad\rm{n<<p}
$$
Thus, in the best case your R packages will throw errors, in the worst case you will get some answer / value which will be completely wrong because of the divergences when computing inverse matrices.
from IPython.display import Image
Image('/home/nikolay/WABI/Misc/SingleCell/Presentation/BayesianDeepLearning/Freq_vs_Bayes.png')
In case of Deep Learning, even if you have lots of data, the Priors besides having regularization effect and reducing overfitting can be very useful because of the uncertainty estimates that they deliver for free. This is very different from traditional Maximum Likelihood Neural Networks which deliver only point estimates, therefore being too categorical in their prediction: Yes or No answer and nothing in between.
from IPython.display import Image
Image('/home/nikolay/WABI/Misc/SingleCell/Presentation/BayesianDeepLearning/Bayesian_vs_Frequentist_Fitting.png')
Now it is time to check how to implement simple Maximum Likelihood and Bayesian Neural Network and compare the outputs. Just to warm up let us start with fitting a simple non-linear 2D curve. We are going to use TensorFlow and Edward, which is a Bayesian analogue of TensorFlow, these two have very similar syntax so it will be easy to compare them. First, let us generate a toy data set with noise:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (15,10)
N = 100
a = tf.placeholder("float")
b = tf.placeholder("float")
c = tf.placeholder("float")
d = tf.placeholder("float")
x = tf.linspace(-5.0, 10.0, num = N, name = "x")
noise = tf.random_normal(shape = [N,], mean = 0, stddev = 50.0, seed = 1)
# y = a*x^3 + b*x^2 + c*x +d
with tf.Session() as sess:
y = sess.run(tf.add(tf.add(tf.add(tf.add(tf.multiply(a,tf.pow(x,3)),tf.multiply(b,tf.pow(x, 2))),tf.multiply(c,x)),d),
noise), feed_dict= {a: -1, b: 10, c: 1, d: 1} )
x = np.reshape(np.array(x.eval()), (N,1))
y = np.reshape(np.array(y) / np.max(y), (N,1))
fig = plt.figure(figsize=(12, 8))
plt.scatter(x, y)
plt.xlabel("x",fontsize = 40)
plt.ylabel("y",fontsize = 40, rotation = 0)
plt.show()
This is a simple polynomial (cubic) dependence where we added some noise. We build the toy data set in TensorFlow on purpose in order to demonstrate the phylosophy of TensorFlow, i.e. how the variables are declared via placeholders and are fed with real values when the Session is initialized. Now let us fit the curve using one-hidden-layer Neural Network created in TensorFlow and check how the quality of the fit changes with the number of hidden neurons (nodes):
learning_rate = 0.1
training_epochs = 1000
display_step = 500
n_samples = x.shape[0] # number of samples, equal N
n_input = 1 # number of input neurons
for my_num_hidden, my_subplot_num in zip([1, 2, 4, 16], range(1,5,1)):
n_hidden = my_num_hidden # number of hidden neurons
print('Fitting curve with {} hidden neurons Maximum Likelihood Neural Network'.format(my_num_hidden))
X = tf.placeholder("float")
Y = tf.placeholder("float")
weights = {'w_hidden': tf.Variable(tf.random_normal([n_input, n_hidden])),
'w_output': tf.Variable(tf.random_normal([n_hidden, n_input]))}
biases = {'b_hidden': tf.Variable(tf.random_normal([n_hidden])),
'b_output': tf.Variable(tf.random_normal([n_input]))}
hidden_layer = tf.nn.sigmoid( tf.add(tf.matmul(X, weights['w_hidden']), biases['b_hidden']) )
output_layer = tf.add(tf.matmul(hidden_layer, weights['w_output']), biases['b_output'])
cost = tf.reduce_mean(tf.pow(output_layer - Y, 2))
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
my_cost = []
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(training_epochs):
c,_ = sess.run([cost, optimizer], feed_dict={X: x, Y: y})
my_cost.append(c)
if (epoch + 1) % display_step == 0:
print("Epoch:", '%04d' % (epoch + 1), "cost = ", "{:.9f}".format(c))
pred = sess.run(output_layer, feed_dict={X: x})
plt.subplot(int("22" + str(my_subplot_num)))
plt.scatter(x, y)
plt.xlabel("x",fontsize = 20)
plt.ylabel("y",fontsize = 20, rotation = 0)
plt.plot(x, pred, c = 'r')
plt.title('{} hidden neurons'.format(my_num_hidden))
print('************************************************************')
plt.show()
fig = plt.figure(figsize=(12, 8))
plt.plot(range(training_epochs), my_cost)
plt.xlabel("Epoch",fontsize = 20)
plt.ylabel("Loss",fontsize = 20, rotation = 1)
plt.show()
You can appreciate how the loss is decreasing while training the Neural Network as well as can observe how the complexity of the fitting curve is changing when adding more hidden neurons (fitting parameters). If we further increase the number of hidden neurons, we will notice that the Neural Network gets prone to overfitting when the number of hidden neurons grows, it tries to draww the fitting line through each single data point which gives a fantastic fit for that particular data set (error is zero) but horrible generalizability since the model basically starts fitting the noise (overfitting).
Now let us train a Bayesian Neural Network with identical settings using Edward, a module for building efficient Bayesian Inference. Edward is based on TensorFlow and the syntax of Edward looks very similar to the syntax of TensorFlow. However, the crucial difference here is that we put Priors on the weights and biases, as well as are using Likelihood function for the data y_train we want to fit. Edward is using so-called Variational Bayes (Variational Inference) approach to fit the Neural Network, which is drammatically faster than Markov Chain Monte Carlo. Let us first demonstrate the eresults and then explain how Variational Bayes works:
import edward as ed
from edward.models import Normal
n_samples = x.shape[0] # number of samples, equal N
n_input = 1 # number of input neurons
for my_num_hidden, my_subplot_num in zip([1, 2, 4, 16], range(1,5,1)):
n_hidden = my_num_hidden # number of hidden neurons
print('Fitting curve with {} hidden neurons Bayesian Neural Network'.format(my_num_hidden))
W_0 = Normal(loc=tf.zeros([n_input, n_hidden]), scale=tf.ones([n_input, n_hidden]))
W_1 = Normal(loc=tf.zeros([n_hidden, n_input]), scale=tf.ones([n_hidden, n_input]))
b_0 = Normal(loc=tf.zeros(n_hidden), scale=tf.ones(n_hidden))
b_1 = Normal(loc=tf.zeros(n_input), scale=tf.ones(n_input))
x_train = x
y_train = Normal(loc=tf.matmul(tf.sigmoid(tf.matmul(x_train, W_0) + b_0), W_1) + b_1, scale = 0.01)
qW_0 = Normal(loc=tf.get_variable("qW_0/loc" + str(my_subplot_num - 1), [n_input, n_hidden]),
scale=tf.nn.softplus(tf.get_variable("qW_0/scale" + str(my_subplot_num - 1), [n_input, n_hidden])))
qW_1 = Normal(loc=tf.get_variable("qW_1/loc" + str(my_subplot_num - 1), [n_hidden, n_input]),
scale=tf.nn.softplus(tf.get_variable("qW_1/scale" + str(my_subplot_num - 1), [n_hidden, n_input])))
qb_0 = Normal(loc=tf.get_variable("qb_0/loc" + str(my_subplot_num - 1), [n_hidden]),
scale=tf.nn.softplus(tf.get_variable("qb_0/scale" + str(my_subplot_num - 1), [n_hidden])))
qb_1 = Normal(loc=tf.get_variable("qb_1/loc" + str(my_subplot_num - 1), [n_input]),
scale=tf.nn.softplus(tf.get_variable("qb_1/scale" + str(my_subplot_num - 1), [n_input])))
inference = ed.KLqp({W_0: qW_0, b_0: qb_0, W_1: qW_1, b_1: qb_1}, data={y_train: y})
inference.run(n_iter = 10000, n_samples = 20)
def neural_network(x, W_0, W_1, b_0, b_1):
h = tf.matmul(tf.sigmoid(tf.matmul(x, W_0) + b_0), W_1) + b_1
return tf.reshape(h, [-1])
mus = []
for s in range(20):
mus.append(neural_network(x_train, qW_0.sample(), qW_1.sample(), qb_0.sample(), qb_1.sample()))
mus = tf.stack(mus)
plt.subplot(int("22" + str(my_subplot_num)))
plt.scatter(x, y)
plt.xlabel("x",fontsize = 20)
plt.ylabel("y",fontsize = 20, rotation = 0)
for i in range(20):
plt.plot(x, mus.eval()[i], c = 'r')
plt.title('{} hidden neurons'.format(my_num_hidden))
print('************************************************************')
plt.show()
First of all we can observe that there is not just one fitting curve but a family of possible fitting curves laying within credible intervals available from fitting the Bayesian Neural Network. The credible intervals result from sampling from the Posteriors of weights and biases which in turn results in the Posterior predictions generated by the output layer of the Bayesian Neural Network.
Now let us explain what exactly Edward was using instead of Markov Chain Monte Carlo (MCMC) sampler. Edward used an approach to computing the Posterior distribution which is called Variational Bayes or Variational Inference (VI). The idea of VI is to analytically approximate the unknown Posterior $p(\omega|D)$ with a simplier function $q_\theta(\omega|D)$, while MCMC draws samples from the unknown Posterior. Then we can just adjust parameters $\omega$ (mean, sd etc.) of the analytical approximation in order to make it as similar to the true Posterior as possible:
from IPython.display import Image
Image('/home/nikolay/WABI/Misc/SingleCell/Presentation/BayesianDeepLearning/VI.png')
Thus, the calculation of the intractable integral in the denominator of the Bayes theorem is replaced by an optimization (differentiation) problem which is much easier to solve. Technically, we need to minimize some sort of a "distance" between the true $p(\omega|D)$ and approximate $q_\theta(\omega|D)$ Posteriors, this distance can be e.g. the Kullbacl-Leibler divergence, which is a functional, i.e. a function of a function $q_\theta(\omega|D)$, therefore we need to vary (fancy word for differentiate) this functional on the family of functions $q_\theta(\omega|D)$ in order to find an optimal function $q_\theta(\omega|D)$.
In practice, people do not minimize the KL divergence fiunctional but minimize the Evidence Lower Bound (ELBO) since KL and ELBO sum up to the constant which actually is the intractable integral in the denominator of the Bayes rule.
It is enough with toy examples and the dull theory of Variational Inference (VI)! Let us move to the fun part which is applying Maximum Likelohood and Bayesian Neural Networks to scRNAseq data. Why is it particularly interesting to use Machine Learning frameworks for scRNAseq? Well, because of the grotescue sample sizes, data sets with >400k and 1.3M cells are freely available! Because of the huge power (overpowered studies) single cell data are going to dominate the analysis in Life Sciences in the nearest future.
from IPython.display import Image
Image('/home/nikolay/WABI/Misc/SingleCell/Presentation/BayesianDeepLearning/10X_1.3M.png')
However, gigantic sample sizes pose a few chellenges which are far beyond handling Big Data and scalability. The main chellenge is how to get maximum information out of the data. Linear Algebra formalism dominating in traditional analysis and programming is not good enough any more because we have the power to catch interesting non-linear effects. This is exactly the benifit of Deep Learning, the deeper you go the more interesting non-linear effects in the data you can discover. As scRNAseq analysis is predominantly based on Dimensionality Reduction and Clustering, a natural choice of a Neural Network to use is the Deep Autoencoder:
from IPython.display import Image
Image('/home/nikolay/WABI/Misc/SingleCell/Presentation/BayesianDeepLearning/autoencoder.png')
Here we will use the 1.3M Mouse Brain 10X genomics data set to give a flavor of how a Deep Autoencoder can be implemented in TensorFlow. A much more elaborated analysis will be presented later in the scRNAseq group discussion. TensorFlow is especially attractive for huge scRNAseq data sets because it is a very optimized and scalable language which was specifically created for working with large amounts of data.
import sys
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings("ignore")
# READ DATA
infile="~/WABI/Misc/SingleCell/10X_Mouse_Brain_1.3M/MouseBrain_10X_1.3M_filtered_withSeuratAnnotation_matrix_reduced.txt"
print("\n" + "You specified input file: " + infile + "\n")
expr = pd.read_csv(infile, sep = '\t', dtype = np.int16, header = None)
print("\n" + "Dimensions of input file: " + str(expr.shape) + "\n")
print("\n" + "A few first lines of input file: " + "\n")
print(expr.iloc[0:4, 0:4])
print("\n" + "Last column corresponds to cluster assignments: " + "\n")
print(expr.iloc[0:4, (expr.shape[1]-4):expr.shape[1]])
# LOG-TRANSFORM DATA
X = expr.values[:,0:(expr.shape[1]-1)]
Y = expr.values[:,expr.shape[1]-1]
print("\n" + "You have following unique cluster labels: " + "\n")
print(set(Y))
print("\n" + "Log-transforming data..." + "\n")
X = np.float16( np.log(X + 1) )
# REDUCE DIMENSIONS WITH PRINCIPAL COMPONENT ANALYSIS (PCA)
print("\n" + "Performing Principal Component Analysis (PCA) ..." + "\n")
n_input = 30
x_train_pca = PCA(n_components = n_input).fit_transform(X)
y_train_pca = Y
plt.figure(figsize=(20, 15))
plt.scatter(x_train_pca[:, 0], x_train_pca[:, 1], c = y_train_pca, cmap = 'tab20', s = 10)
plt.title('Principal Component Analysis (PCA)')
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()
As we can see the PCA looks messy, linear dimensionality reduction is clearly not good enough for capturing distinct scRNAseq clusters of Mouse Brain cells. We need a non-linear scalable dimenionality reduction such as a Deep Autoencoder. Potentially tSNE is the greatest choice for scRNAseq data visualization, however it has problems with scalability which Autoencoders do not have. Alternatively one can program tSNE in TensorFlow which should be the fastest implementation possible to date.
import keras
import numpy as np
import pandas as pd
from keras.layers import Dense
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from keras.optimizers import Adam
from sklearn.decomposition import PCA
from keras.models import Sequential, Model
import warnings
warnings.filterwarnings("ignore")
# REDUCE DIMENSIONS WITH AUTOENCODER
model = Sequential()
model.add(Dense(100, activation='elu', kernel_initializer='he_uniform', input_shape=(n_input,)))
model.add(Dense(20, activation='elu', kernel_initializer='he_uniform'))
model.add(Dense(2, activation='linear', kernel_initializer='he_uniform', name="bottleneck"))
model.add(Dense(20, activation='elu', kernel_initializer='he_uniform'))
model.add(Dense(100, activation='elu', kernel_initializer='he_uniform'))
model.add(Dense(n_input, activation='sigmoid'))
model.compile(loss = 'mean_squared_error', optimizer = Adam(lr = 0.001))
model.summary()
# DEFINE HYPERPARAMETERS
learning_rate = 0.01
training_epochs = 100
display_step = 10
num_hidden_1 = 100 # 1st hidden layer num features
num_hidden_2 = 20 # 2nd hidden layer num features
num_bottleneck = 2 # bottleneck num features
num_input = X.shape[1] # scRANAseq data input (number of features / genes)
# TENSORFLOW GRAPH INPUT
x = tf.placeholder("float")
y = tf.placeholder("float")
weights = {
'encoder_h1': tf.Variable(tf.random_normal([num_input, num_hidden_1])),
'encoder_h2': tf.Variable(tf.random_normal([num_hidden_1, num_hidden_2])),
'bottleneck': tf.Variable(tf.random_normal([num_hidden_2, num_bottleneck])),
'decoder_h1': tf.Variable(tf.random_normal([num_bottleneck, num_hidden_2])),
'decoder_h2': tf.Variable(tf.random_normal([num_hidden_2, num_hidden_1])),
'decoder_out': tf.Variable(tf.random_normal([num_hidden_1, num_input])),
}
biases = {
'encoder_b1': tf.Variable(tf.random_normal([num_hidden_1])),
'encoder_b2': tf.Variable(tf.random_normal([num_hidden_2])),
'bottleneck': tf.Variable(tf.random_normal([num_bottleneck])),
'decoder_b1': tf.Variable(tf.random_normal([num_hidden_2])),
'decoder_b2': tf.Variable(tf.random_normal([num_hidden_1])),
'decoder_out': tf.Variable(tf.random_normal([num_input])),
}
# CONSTRUCT AUTOENCODER MODEL
print("\n" + "Constructing Autoencoder Model ..." + "\n")
def encoder(x):
layer_1 = tf.nn.relu(tf.add(tf.matmul(x, weights['encoder_h1']), biases['encoder_b1']))
layer_2 = tf.nn.relu(tf.add(tf.matmul(layer_1, weights['encoder_h2']), biases['encoder_b2']))
bottleneck = tf.add(tf.matmul(layer_2, weights['bottleneck']), biases['bottleneck'])
return bottleneck
def decoder(x):
layer_1 = tf.nn.relu(tf.add(tf.matmul(x, weights['decoder_h1']), biases['decoder_b1']))
layer_2 = tf.nn.relu(tf.add(tf.matmul(layer_1, weights['decoder_h2']), biases['decoder_b2']))
layer_out = tf.nn.sigmoid(tf.add(tf.matmul(layer_2, weights['decoder_out']), biases['decoder_out']))
return layer_out
encoder_op = encoder(x)
decoder_op = decoder(encoder_op)
y_pred = decoder_op
y_true = x
cost = tf.reduce_mean(tf.pow(y_true - y_pred, 2))
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
# START TRAINING AUTOENCODER
print("\n" + "Start Training Autoencoder ..." + "\n")
my_cost = []
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(training_epochs):
c,_ = sess.run([cost, optimizer], feed_dict={x: X, y: Y})
my_cost.append(c)
if (epoch + 1) % display_step == 0:
print("Epoch:", '%04d' % (epoch + 1), "cost = ", "{:.9f}".format(c))
pred = sess.run(encoder(x), feed_dict={x: X})
plt.plot(range(training_epochs), my_cost)
plt.xlabel("Epoch",fontsize = 20)
plt.ylabel("Loss",fontsize = 20, rotation = 1)
plt.show()
# VISUALIZE AUTOENCODER BOTTLENECK
plt.figure(figsize=(20, 15))
plt.scatter(pred[:,0], pred[:,1], c = Y, s = 10, cmap = 'tab20')
plt.title('Autoencoder: 5 Hidden Layers')
plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")
plt.show()
Here we are going to show how to use TensorFlow to build a nonlinear classifier for a real scRNAseq data det for Cancer-Associated Fibroblasts (CAFs) from Kristian Pietras project, Lund University. We will start with loading modules, reading and transforming the data. Please note that the last column corresponds to cluster assignment of the cells:
import sys
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
# READ DATA
expr = pd.read_csv('/home/nikolay/WABI/K_Pietras/Manifold_Learning/bartoschek_filtered_expr_rpkm.txt', sep='\t')
print("\n" + "Dimensions of input file: " + str(expr.shape) + ", that means you have " + str(expr.shape[0]) +
" cells and " + str(expr.shape[1] - 1) + " genes remained after filtering" + "\n")
print("\n" + "A few first lines of input file: " + "\n")
print(expr.iloc[0:4, 0:4])
print("\n" + "Last column corresponds to cluster assignments: " + "\n")
print(expr.iloc[0:4, (expr.shape[1]-4):expr.shape[1]])
# LOG-TRANSFORM DATA
X = expr.values[:,0:(expr.shape[1]-1)]
Y = expr.values[:,expr.shape[1]-1]
print("\n" + "A few cluster labels: " + "\n")
print(Y[0:8])
print("\n" + "Number of unique clusters: " + str(len(set(Y))) + "\n")
print("\n" + "You have following unique cluster labels: " + "\n")
print(set(Y))
print("\n" + "Log-transforming data..." + "\n")
X = np.log(X + 1)
Next, we will perform tSNE dimensionality reduction and select only first two tSNE components to train our Neural Network classifier on, so only two dimensions will be used in the further analysis:
X_reduced = PCA(n_components = 30).fit_transform(X)
model = TSNE(learning_rate = 10, n_components = 2, random_state = 123, perplexity = 30)
tsne = model.fit_transform(X_reduced)
plt.scatter(tsne[:, 0], tsne[:, 1], c = Y, cmap = 'viridis', s = 10)
plt.title('tSNE on PCA: Cancer-Associated Fibroblasts (CAFs) sub-populations, scRNAseq data', fontsize = 15)
plt.xlabel("tSNE1", fontsize = 15)
plt.ylabel("tSNE2", fontsize = 15)
plt.show()
Now we will split the 4-classes data set into train and test subsets while keeping fractions of each class constant, and convert the labels for clusters into one-hot-encoded format:
X_tsne = tsne[:, 0:2]
Y_tsne = Y
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X_tsne, Y_tsne, test_size = 0.4, stratify = None, random_state = 12)
from sklearn.preprocessing import OneHotEncoder
Y_train_original = Y_train
Y_train = np.array( OneHotEncoder().fit_transform(Y_train.reshape(-1, 1)).todense() )
Y_test_original = Y_test
Y_test = np.array( OneHotEncoder().fit_transform(Y_test.reshape(-1, 1)).todense() )
Y_train
Before building the model in TensorFlow, the last thing we need to do is to define a grid for further plotting the decision boundaries of the classifier. The grid is based on X_test, we place equidistantly 100 points from minimal X_test to maximal X_test values, remember those are the tSNE coordinates. Next, we will build a two-dimensional numpy array for each of the 100 x 100 points of the grid, we will execute the classifier at each of those grid points and assign cluster label (with certain softmax probability) to each of the grid points:
x_min, x_max = X_test[:, 0].min() - 1, X_test[:, 0].max() + 1
y_min, y_max = X_test[:, 1].min() - 1, X_test[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 101), np.linspace(y_min, y_max, 101))
print(xx.ravel())
print(yy.ravel())
my_grid_values = np.c_[xx.ravel(), yy.ravel()]
my_grid_values
Finally, we will build our model in TensorFlow, it is simple one-hidden-layer model, so far we are not building deep neural networks. We will use Adam optimizer to minimize softmax cross entropy since this is a multi-class classification problem. The output layer generates predictions, there are two prediction variables used: 1) pred is for evaluating the model using the true labels from the test data set, 2) pred_my_grid is for assigning a class label to each of the 100 x 100 grid points, this will allow us to visualize decision boundaries later.
import tensorflow as tf
learning_rate = 0.01
training_epochs = 10000
display_step = 1000
n_samples = X_train.shape[0]
n_classes = len(set(Y))
n_input = X_train.shape[1]
n_hidden = 16
X = tf.placeholder("float")
Y = tf.placeholder("float")
weights = {'w_hidden': tf.Variable(tf.random_normal([n_input, n_hidden])),
'w_output': tf.Variable(tf.random_normal([n_hidden, n_classes]))}
biases = {'b_hidden': tf.Variable(tf.random_normal([n_hidden])),
'b_output': tf.Variable(tf.random_normal([n_classes]))}
hidden_layer = tf.nn.sigmoid( tf.add( tf.matmul(X, weights['w_hidden']), biases['b_hidden'] ) )
output_layer = tf.add( tf.matmul( hidden_layer, weights['w_output'] ), biases['b_output'] )
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = output_layer, labels = Y))
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
my_cost = []
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(training_epochs):
c,_ = sess.run([cost, optimizer], feed_dict={X: X_train, Y: Y_train})
my_cost.append(c)
if (epoch + 1) % display_step == 0:
print("Epoch:", '%04d' % (epoch + 1), "cost = ", "{:.9f}".format(c))
#EVALUATE MODEL
pred_my_grid = sess.run( tf.nn.softmax(output_layer), feed_dict={X: my_grid_values} )
pred = sess.run( tf.nn.softmax(output_layer), feed_dict={X: X_test} )
check_prediction = sess.run( tf.equal(tf.argmax(pred, 1), tf.argmax(Y_test, 1)) )
accuracy = sess.run( tf.reduce_mean( tf.cast(check_prediction, "float") ) )
fig = plt.figure(figsize=(12, 8))
plt.plot(range(training_epochs), my_cost)
plt.xlabel("Epoch", fontsize = 20)
plt.ylabel("Loss", fontsize = 20, rotation = 1)
plt.show()
The loss curve reachs its plateau which is a good indication of convergence of our model. Let us now display evaluation results for the test data set. We will build a data frame with correct class labels, predicted labesl as well as a simple True / False check of whether predicted label agrees with the true label. Finally, we display the accuracy value and the confusion matrix where we can see samples for which classes were misclassified.
pred_output = pd.DataFrame({'True': np.array([np.argmax(i) for i in Y_test]),
'Pred': np.array([np.argmax(i) for i in pred]),
'Check': check_prediction})
print(pred_output.iloc[0:10,0:3])
print("Accuracy:", accuracy)
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
print(classification_report([np.argmax(i) for i in Y_test],
[np.argmax(i) for i in np.array([np.round(i) for i in pred])]))
fig = plt.figure(figsize=(12, 10))
sns.heatmap(pd.DataFrame(confusion_matrix([np.argmax(i) for i in Y_test],
[np.argmax(i) for i in np.array([np.round(i) for i in pred])] )),
annot=True, fmt='d', cmap='YlGnBu', alpha=0.8, vmin=0)
plt.show()
Now we are going to do something interesting, we are going to draw the decision boundaries for the classifier. We are going to display the probability of assigning each point of the 100 x 100 grid to a certain class (cell sub-population). In this way we could understand what would be a probability of assigning each sample / cell from the test data set to each class. Actually, we have the softmax probability for each class for each sample information stored in the pred_my_grid variable:
print(pred_my_grid)
Thus we can contruct a data frame where for each sample we have information to which class it is predicted to belong to and the softmax probability of such prediction:
pd.DataFrame({'Class': np.array([np.argmax(i) for i in pred_my_grid]),
'Prob': np.array([np.max(i) for i in pred_my_grid])}).sample(frac=1).iloc[0:10, 0:2]
Now we will construct a new Z_prob variable which will contain the probability of predicted class. Finally, we will display the probability heatmap on the tSNE plot:
Z_prob = np.array([np.max(i) for i in pred_my_grid])
Z_prob = Z_prob.reshape(xx.shape)
Z_prob
fig = plt.figure(figsize=(20, 8))
plt.subplot(121)
plt.scatter(X_test[:, 0], X_test[:, 1], c = Y_test_original, cmap = 'viridis', s = 20)
plt.title('tSNE on PCA', fontsize = 20)
plt.xlabel("tSNE1", fontsize = 20)
plt.ylabel("tSNE2", fontsize = 20)
plt.subplot(122)
plt.contourf(xx, yy, Z_prob, cmap = plt.cm.coolwarm, alpha = 0.8)
plt.colorbar()
plt.scatter(X_test[:, 0], X_test[:, 1], c = Y_test_original, cmap = 'viridis', s = 20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.show()
Plotting the probability for the most likely class of cells draws basically decision boundaries between the cell sub-populations. The blueish areas depict regions where probability of assignment to a most likely class approaches 50%. Please notice that a yellow point is included into the crowd of purple points having a red are as a background, this implies that the probability of assigning that yellow cell to the class of purple cells is almost 100%. This is because we were using a Maximum Likelihood Neural Network which lacks uncertainty information about probabilities of assignment to a cluster.
Now we are going to build a Bayesian Neural Network using PyMC3 Python module. The idea behind Bayesian Neural Networks is that weights and biases are not point estimates but they have Prior probabilities. The strength of using PyMC3 is that we can use both MCMC and Variational Inference (VI) approaches, in contrast to Edward which is purely VI-based tool, while it is quite questionable whether VI can completely replace proper samplers for complex Posteriors. Another benifit of PyMC3 is that one can create any model, including Neural Network, and simply run a sampler on the model in the Bayesian fashion. Thus we will start with loading modules, PyMC3 is Theano based so it is especially impoirtant to load theano and theano.tensor, the latter contains the softmax activation function to be used for the output layer:
import theano
import pymc3 as pm
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import theano.tensor as tt
from warnings import filterwarnings
filterwarnings('ignore')
sns.set_style('white')
Next we are going to contruct a model whcih is a Bayesian Neural Network with one hidden layer and n_hidden neurons. We will assign Normal Priors to weights and biases and initialize them with random values. Within the model we define also the likelihood which is a Categorical distribution since we are dealing with a scRNAseq multi-class (4 classes) classification problem.
def build_bayesian_neural_network(X, Y):
n_samples = X.get_value().shape[0]
n_classes = Y.get_value().shape[1]
n_input = X.get_value().shape[1]
n_hidden = 16
with pm.Model() as model:
# Priors for unknown model parameters
w_hidden = pm.Normal( 'w_hidden', mu = 0, sd = 1, shape = (n_input, n_hidden),
testval = np.random.randn(n_input, n_hidden) )
w_output = pm.Normal( 'w_output', mu = 0, sd = 1, shape = (n_hidden, n_classes),
testval = np.random.randn(n_hidden, n_classes) )
b_hidden = pm.Normal( 'b_hidden', mu = 0, sd = 1, shape = (n_hidden),
testval = np.random.randn(n_hidden) )
b_output = pm.Normal( 'b_output', mu = 0, sd = 1, shape = (n_classes),
testval = np.random.randn(n_classes) )
# Expected value of outcome
hidden_layer = pm.math.tanh( pm.math.dot(X.get_value(), w_hidden) + b_hidden )
output_layer = pm.Deterministic('output_layer', tt.nnet.softmax(pm.math.dot(hidden_layer, w_output) + b_output))
# Likelihood (sampling distribution) of observations
likelihood = pm.Categorical('likelihood', output_layer, observed = np.where(Y.get_value())[1])
return model
Please note that putting Priors on the weights and biases we let the model know that those parameters have uncertainties, therefore the MCMC sampler will build Posterior distributions for them. However, we are also interested in knowing the Posterior distribution for the prediction of the output layer (which is a combination of input, weights and biases), i.e. not only softmax probabilities for each class but also the confidence intervals for those probabilities. PyMC3 will not track those Posterior distributions by default because we have not explicitly assigned Priors to the output layer. In PyMC3, pm.Deterministic is a way to tell the MCMC sampler to follow that variable (output layer in our case) as well and build a Posterior for that variable.
Now we are going to define a function whcih draws samples from the Posteriors of the parameters of the Bayesian Neural Network using one of the Hamiltonian Monte Carlo (a much faster sampler compared to e.g. Metropolis when derivatives of the parameters can be calculated) algorithms called NUTS.
def train_bayesian_neural_network(model):
with model:
draws = 1000
start = pm.find_MAP(maxeval = 10000)
step = pm.NUTS()
trace = pm.sample(draws = draws, step = step, start = start)
return trace
Here we used pre-calculated MAP approximation for initiating parameters of the Bayesian Neural Network. You can find information about any PyMC3 method, including the MAP approximation through the handy help available in PyMC3:
help(pm.find_MAP)
Now let us start training the Bayesian Neural Network by using the MCMC sampler for drawing samples from the Posteriors with NUTS sampler by providing X_train and Y_train data set. Here we use the following trick: we crate shared theano variables and assign X_train and Y_train to them, those variables can be later reset to X_test and Y_test.
%%time
X = theano.shared(X_train)
Y = theano.shared(Y_train)
model = build_bayesian_neural_network(X, Y)
trace = train_bayesian_neural_network(model)
The MCMC NUTS sampler has built Posteriors for all model parameters including the values on the output layer. The variable trace['output_layer'] is a tensor (3-dimensional array), where first dimension is 2000 samples drawn from the Posterior of the output layer, this number comes from 2 chains x 1000 samples drawn for each chain. Second dimension is the number of samples in the train data set. The third dimension is probabilities of each of the 4 classes. Thus for each cell on the train tSNE plot (429 cells) we have 4 probabilities of assignment to one of the 4 clusters. These probabilities of assignments were drawn 2000 times from from the Posteriors of the output layer.
trace['output_layer'].shape
pm.summary(trace).iloc[0:10,0:10]
_ = pm.traceplot(trace)
plt.show()
Let us average ove the 2000 samples and create 4 probabilities of assignment to one of the 4 clusters for each of 429 cells, those are predicted labels of the model, we compare them with the true T_train one-hot encoded labels:
print(trace['output_layer'].mean(axis=0))
Y_train
The predictions given y the output layer seem to correctly predict the true class labels, and the accuracy of such predictions is of course very high due to evaluating the model on the training data set, this is not what we want since this is not a correct model evaluation:
print('Accuracy={}%'.format((np.where(Y_train)[1]==[np.argmax(i)for i in trace['output_layer'].mean(axis=0)]).mean()*100))
In order to get a correct model evaluation, let us critisize the model by applying so-called a Posterior Predictive Check (PPC) procedure on the hold-out test dat set. PPC randomly selects 500 samples from the Posteriors of the parameters of the trained Bayesian Neural Network and builds a Normal distribution centered around each of the 500 drawn values. Then it draws 100 samples from those 500 distributions. This is equivalent to checking how well our model can generate data which resemble as much as possible the true data.
# Replace shared variables with testing set
X.set_value(X_test)
Y.set_value(Y_test)
model = build_bayesian_neural_network(X, Y)
# Create posterior predictive samples
ppc = pm.sample_posterior_predictive(trace, model = model, samples = 500, random_seed = 0)
pred = ppc['likelihood'].mean(axis=0)
ppc['likelihood'].shape
ppc['likelihood']
Finally the accuracy calculated on the hold-out test data set, i.e. the ultimate model evaluation accuracy is:
print('Accuracy = {}%'.format((np.where(Y_test)[1] == np.round(pred)).mean() * 100))
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
print(classification_report([np.argmax(i) for i in Y_test], np.round(pred)))
fig = plt.figure(figsize=(10, 8))
sns.heatmap(pd.DataFrame(confusion_matrix([np.argmax(i) for i in Y_test], np.round(pred) )),
annot=True, fmt='d', cmap='YlGnBu', alpha=0.8, vmin=0)
plt.show()
This is a very good validation accuracy! If we have a look and compare the class assignment predictions on the test data set with the true Y_test labels we conclude that they look quite similar and result in similar 4-class Categorical / Multinomial distributions:
print(list(np.round(pred)))
print(list(np.where(Y_test)[1]))
import seaborn as sns
fig = plt.figure(figsize=(20, 6))
plt.subplot(121)
sns.distplot(list(np.round(pred)))
plt.title('Prediction', fontsize = 20)
plt.subplot(122)
sns.distplot(np.where(Y_test)[1])
plt.title('Truth', fontsize = 20)
plt.show()
Now comes the most interesting part of analysis. Let us calculate compute and display the tSNE plot together with decision boundaries, the way we did for the Maximum Likelihood Neural Network with TensorFlow in the previous section. For this purpose we again create a 100 x 100 grid and run the model prediction on each point of the grid:
x_min, x_max = X_test[:, 0].min() - 1, X_test[:, 0].max() + 1
y_min, y_max = X_test[:, 1].min() - 1, X_test[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 101), np.linspace(y_min, y_max, 101))
my_grid_values = np.c_[xx.ravel(), yy.ravel()]
The grid points are our new X variable, we will also create a dummy Y variable which we will feed to PPC. The Y variable (labels) is dummy because we actually have already trained our Bayesian Neural Network using X_train, so now we only need an X variable (input data) but not the labels to evaluate the model on the grid using Posterior Predictive Check (PPC) procedure. The labels will be calculated / predicted by the model. However we can run PPC in PyMC3 only using a model as a context, an the model has 2 input variables X and Y, this implies we need to feed some dummy variable Y just to follow the model syntax. It is important that the dimensions of the dummy variable are identical to Y_test and Y_train. Therefore, we pretend that all 10201 points on the grid belong to the class label 1 (out of class labels 0, 1, 2, 3), i.e. use [0,1,0,0] as an array of probabilities to belong to a certain class.
dummy_out = np.array([[0,1,0,0]]*10201)
dummy_out.shape
Now we are going to run the Posterior Predictive Checks (PPCs) a number of times for each point ion the 100 x 100 grid in order to get a mean and an uncertainty estimates for each point of the grid. Since PPCs returns a label of assignment to each class on each grid point, we need to recalculate the labels into probabilities of observing each of the 4 labels.
Z_prob_list = []
for k in range(10):
X.set_value(np.float32(my_grid_values))
Y.set_value(np.float32(dummy_out))
model = build_bayesian_neural_network(X, Y)
ppc = pm.sample_posterior_predictive(trace, model = model, samples = 500, progressbar = False)
if (k + 1) % 10 == 0:
print('Finished {} posterior predictive checks'.format(k + 1))
my_probs = []
for j in range(ppc['likelihood'].shape[1]):
my_probs.append([sum(np.array([i[j] for i in ppc['likelihood']]) == 0) / ppc['likelihood'].shape[0],
sum(np.array([i[j] for i in ppc['likelihood']]) == 1) / ppc['likelihood'].shape[0],
sum(np.array([i[j] for i in ppc['likelihood']]) == 2) / ppc['likelihood'].shape[0],
sum(np.array([i[j] for i in ppc['likelihood']]) == 3) / ppc['likelihood'].shape[0]])
Z_prob = np.array([np.max(i) for i in np.array(my_probs)])
Z_prob_list.append(Z_prob)
Nex we calculate the mean and the standard deviation for probability of assigning each point of the grid to one of the 4 classes of cells, and visualize the mean probability and uncertainty:
Z_prob_mean = np.array(Z_prob_list).mean(axis=0)
Z_prob_mean = Z_prob_mean.reshape(xx.shape)
Z_prob_std = np.array(Z_prob_list).std(axis=0)
Z_prob_std = Z_prob_std.reshape(xx.shape)
Z_prob_std
fig = plt.figure(figsize=(25, 20))
plt.subplot(221)
plt.scatter(X_test[:, 0], X_test[:, 1], c = Y_test_original, cmap = 'viridis', s = 20)
plt.title('tSNE on PCA', fontsize = 20)
plt.xlabel("tSNE1", fontsize = 20)
plt.ylabel("tSNE2", fontsize = 20)
plt.subplot(222)
plt.contourf(xx, yy, Z_prob_mean, cmap = plt.cm.coolwarm, alpha = 0.8)
plt.colorbar()
plt.scatter(X_test[:, 0], X_test[:, 1], c = Y_test_original, cmap = 'viridis', s = 20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.subplot(223)
sns.distplot(np.round(ppc['likelihood'].mean(axis=0)))
plt.title('Calss Assignment Distribution', fontsize = 20)
plt.xlabel("Class Assignment", fontsize = 20)
plt.ylabel("Density", fontsize = 20)
plt.subplot(224)
cmap = sns.cubehelix_palette(light=1, as_cmap=True)
plt.contourf(xx, yy, Z_prob_std, cmap = cmap, alpha = 0.8)
plt.colorbar()
plt.scatter(X_test[:, 0], X_test[:, 1], c = Y_test_original, cmap = 'viridis', s = 20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.show()
We can see that the region between the purple and the yeallow cluster demonstrates especially high uncertainty of assignment of a cell from that region to either cluster. The same but less pronounced effect is visible for the boundary between the purple and the green cluster. Thus for each cell from the hold-out test data set we have a probability of assignment to a cluster and a measure of how uncertain is this probability.
pd.DataFrame({'Cluster': np.round(ppc['likelihood'].mean(axis=0)),
'Probability': np.array(Z_prob_list).mean(axis=0),
'Uncertanty': np.array(Z_prob_list).std(axis=0)}).sample(frac=1).iloc[0:10,0:3]
In this section we will apply Bayesian Neural Netrwok created in Edward to a classifiaction problem, namely we will try the model to recognize subtypes of Cancer Associated Fibroblasts (CAFs) from the scRNAseq project of Kristian Pietras from Lund University. Later we will include fibroblasts from another scRNAseq data set, Colorectal Cancer (CRC) which comes from different technology and different organism and non-fibroblast cells from CRC and demonstrate the model's higher uncertanty in assigning non-fibroblast cells to one of the CAF clusters. Let us start with reading CAFs, CRC fibroblast and CRC non-fibroblast scRNAseq data, we specifically selected only common genes for all the 3 data sets:
import pandas as pd
cafs = pd.read_csv('/home/nikolay/WABI/K_Pietras/CRC/CAFs_XGBoost.txt',sep='\t')
crc_fibro = pd.read_csv('/home/nikolay/WABI/K_Pietras/CRC/CRC_FIBRO_RF.txt',sep='\t')
crc_all_cells_minus_fibro = pd.read_csv('/home/nikolay/WABI/K_Pietras/CRC/CRC_ALL_CELLS_MINUS_FIBRO_RF.txt',sep='\t')
print('CAFs data set has dimensions: {}'.format(cafs.shape))
print('CRC fibroblast data set has dimensions: {}'.format(crc_fibro.shape))
print('CRC other cells data set has dimensions: {}'.format(crc_all_cells_minus_fibro.shape))
cafs.iloc[0:5,0:10]
We can see that the CRC fibroblast data set contains 17 cells. Therefore we will randonly select 17 CAFs cells as a hold-out test data set, and keep the rest cells for training the classifier:
import numpy as np
np.random.seed(1)
samples_to_select = list(np.random.choice(list(cafs.index), 17, replace=False))
samples_to_keep = [i for i in list(cafs.index) if i not in samples_to_select]
print('We will randomply select {} samples'.format(len(samples_to_select)))
print('And keep the rest {} samples'.format(len(samples_to_keep)))
cafs_test = cafs.loc[samples_to_select,:]
X_cafs_test = cafs_test.values[:,0:(cafs.shape[1]-1)]
Y_cafs_test = cafs_test.values[:,cafs.shape[1]-1]
cafs_train = cafs.loc[samples_to_keep,:]
X_cafs_train = cafs_train.values[:,0:(cafs.shape[1]-1)]
Y_cafs_train = cafs_train.values[:,cafs.shape[1]-1]
Let us also randomly select 17 cells other than fibroblasts from the Colorectal Cancer (CRC) data set. We will use them for comparision of the classification results with the 17 fibroblasts from the the same data set using the Bayesian Neural Network trained on the CAFs data set. The hypothesis is that fibroblasts from the CRC data set will have a higher uncertainty of prediction compared to the native 17 test CAFs, and the non-fibroblast cells from the CRC data set will have even higher uncertanty of classification, i.e. the model will say "I do not know how to classify those 17 non-fibroblast cells because I was trained on fibroblasts only".
nonfibro_samples_to_select = list(np.random.choice(list(crc_all_cells_minus_fibro.index), 17, replace=False))
crc_all_cells_minus_fibro = crc_all_cells_minus_fibro.loc[nonfibro_samples_to_select,:]
crc_fibro = crc_fibro.values
crc_all_cells_minus_fibro = crc_all_cells_minus_fibro.values
nonfibro_samples_to_select
Now X_cafs_test, crc_fibro and crc_all_cells_minus_fibro should have 17 cells each. We will also one-hot encode the labels of the train and test CAFs data sets and keep the original label codes in Y_cafs_train_original and Y_cafs_test_original:
from sklearn.preprocessing import label_binarize
Y_cafs_train_original = Y_cafs_train
Y_cafs_train = label_binarize(Y_cafs_train, classes = [0, 1, 2, 3])
Y_cafs_test_original = Y_cafs_test
Y_cafs_test = label_binarize(Y_cafs_test, classes = [0, 1, 2, 3])
print('CAFs test data set has dimensions: {}'.format(X_cafs_test.shape))
print('CRC fibroblast data set has dimensions: {}'.format(crc_fibro.shape))
print('CRC other cells data set has dimensions: {}'.format(crc_all_cells_minus_fibro.shape))
Now let us make our Variational Inference model of Bayesian Neural Network in Edward:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import tensorflow as tf
from edward.models import Categorical, Normal
import edward as ed
import pandas as pd
n_samples = X_cafs_train.shape[0] # number of samples
n_input = X_cafs_train.shape[1] # number of input neurons (features)
n_classes = Y_cafs_train.shape[1] # number of classes
n_hidden = 64 # number of hidden neurons
# Create a placeholder to hold the data (in minibatches) in a TensorFlow graph.
x = tf.placeholder(tf.float32, [n_samples, n_input])
y = tf.placeholder(tf.float32, [n_samples, n_classes])
# Normal(0,1) priors for the variables. Note that the syntax assumes TensorFlow 1.1.
w_hidden = Normal(loc = tf.zeros([n_input, n_hidden]), scale = tf.ones([n_input, n_hidden]))
w_output = Normal(loc = tf.zeros([n_hidden, n_classes]), scale = tf.ones([n_hidden, n_classes]))
b_hidden = Normal(loc = tf.zeros(n_hidden), scale = tf.ones(n_hidden))
b_output = Normal(loc = tf.zeros(n_classes), scale = tf.ones(n_classes))
# Categorical likelihood for classication.
y = Categorical(tf.matmul(tf.tanh(tf.matmul(x, w_hidden) + b_hidden), w_output) + b_output)
# Contruct the q(w) and q(b). in this case we assume Normal distributions.
qw_hidden = Normal(loc=tf.Variable(tf.random_normal([n_input, n_hidden])),
scale=tf.nn.softplus(tf.Variable(tf.random_normal([n_input, n_hidden]))))
qb_hidden = Normal(loc=tf.Variable(tf.random_normal([n_hidden])),
scale=tf.nn.softplus(tf.Variable(tf.random_normal([n_hidden]))))
qw_output = Normal(loc=tf.Variable(tf.random_normal([n_hidden, n_classes])),
scale=tf.nn.softplus(tf.Variable(tf.random_normal([n_hidden, n_classes]))))
qb_output = Normal(loc=tf.Variable(tf.random_normal([n_classes])),
scale=tf.nn.softplus(tf.Variable(tf.random_normal([n_classes]))))
inference = ed.KLqp({w_hidden: qw_hidden, b_hidden: qb_hidden, w_output: qw_output, b_output: qb_output},
data={x: X_cafs_train, y: Y_cafs_train_original})
inference.run(n_iter = 10000) #n_samples = 20
Now, once we have computed an approximation to the true Posterior, we can draw samples from the approximation and use them to predict the probabilities of class assignment for 17 CAFs hold-out test cells, 17 CRC fibroblasts and CRC non-fibroblast cells:
# Generate samples from the posterior and store them.
n_samples = 1000
prob_cafs_test_list = []
prob_crc_fibro_list = []
prob_crc_all_cells_minus_fibro_list = []
w_hidden_samples = []
b_hidden_samples = []
w_output_samples = []
b_output_samples = []
for i in range(n_samples):
# Sample from the posterior of fitting parameters of Bayesian Neural Network
w_hidden_samp = qw_hidden.sample()
b_hidden_samp = qb_hidden.sample()
w_output_samp = qw_output.sample()
b_output_samp = qb_output.sample()
# Store samples from the posterior of fitting parameters of Bayesian Neural Network
w_hidden_samples.append(w_hidden_samp)
b_hidden_samples.append(b_hidden_samp)
w_output_samples.append(w_output_samp)
b_output_samples.append(b_output_samp)
# Compute the probabiliy of each class for each sample for CAFs test data set, CRC fibro- and non-fibroblast cells
prob_cafs_test_list.append(tf.nn.softmax(tf.matmul(tf.tanh(tf.matmul(np.float32(X_cafs_test), w_hidden_samp)
+ b_hidden_samp), w_output_samp) + b_output_samp).eval())
prob_crc_fibro_list.append(tf.nn.softmax(tf.matmul(tf.tanh(tf.matmul(np.float32(crc_fibro), w_hidden_samp)
+ b_hidden_samp), w_output_samp) + b_output_samp).eval())
prob_crc_all_cells_minus_fibro_list.append(
tf.nn.softmax(tf.matmul(tf.tanh(tf.matmul(np.float32(crc_all_cells_minus_fibro), w_hidden_samp)
+ b_hidden_samp), w_output_samp) + b_output_samp).eval())
if (i + 1) % 100 == 0:
print('Finished drawing {} samples'.format(i + 1))
Let us calculate thhe accuracy of prediction for the CAFs hold-out test data set:
accuracy_list = []
for prob in prob_cafs_test_list:
pred = np.argmax(prob, axis = 1).astype(np.float32)
accuracy_list.append((pred == np.where(Y_cafs_test)[1]).mean()*100)
fig = plt.figure(figsize = (12, 8))
plt.hist(accuracy_list, bins = 10)
plt.title("Histogram of prediction accuracies in the CAFs test data", fontsize = 20)
plt.xlabel("Accuracy", fontsize = 20)
plt.ylabel("Frequency", fontsize = 20)
plt.show()
print('Accuracy = {0} +/- {1}%'.format(np.round(np.mean(np.array(accuracy_list)), 2),
np.round(np.std(np.array(accuracy_list)), 2)))
Finally, we will plot probability and uncertainty values for CAFs test, CRC fibroblasts and CRC non-fibroblats (all the three have 17 cells each) as box-plots and check how significant is the difference between the values:
import matplotlib.pyplot as plt
import seaborn as sns
fig = plt.figure(figsize=(20, 8))
#plt.title('Comparison', fontsize = 20)
ax = fig.add_subplot(121)
sns.boxplot(data=[[np.max(i) for i in np.array(prob_cafs_test_list).mean(axis=0)],
[np.max(i) for i in np.array(prob_crc_fibro_list).mean(axis=0)],
[np.max(i) for i in np.array(prob_crc_all_cells_minus_fibro_list).mean(axis=0)]], palette = 'tab10')
plt.ylabel('Probability',fontsize=20)
ax.set_xticklabels(['CAFs_Test','CRC_Fibro','CRC_NonFibro'], fontsize = 20)
ax = fig.add_subplot(122)
sns.boxplot(data=[[np.max(i) for i in np.array(prob_cafs_test_list).std(axis=0)],
[np.max(i) for i in np.array(prob_crc_fibro_list).std(axis=0)],
[np.max(i) for i in np.array(prob_crc_all_cells_minus_fibro_list).std(axis=0)]],
palette = 'tab10')
plt.ylabel('Uncertainty',fontsize=20)
ax.set_xticklabels(['CAFs_Test','CRC_Fibro','CRC_NonFibro'], fontsize = 20)
plt.show()
from scipy.stats import mannwhitneyu
stat, p = mannwhitneyu([np.max(i) for i in np.array(prob_crc_fibro_list).mean(axis=0)],
[np.max(i) for i in np.array(prob_crc_all_cells_minus_fibro_list).mean(axis=0)])
print('Statistics = %.3f, p = %.3f' % (stat, p))
from scipy.stats import mannwhitneyu
stat, p = mannwhitneyu([np.max(i) for i in np.array(prob_crc_fibro_list).std(axis=0)],
[np.max(i) for i in np.array(prob_crc_all_cells_minus_fibro_list).std(axis=0)])
print('Statistics = %.3f, p = %.10f' % (stat, p))
We conclude that the model was significantly higher uncertainty about classification of CRC non-fibroblast compared to CRC fibroblasts cells. This high uncertainty can be a strong indication that the model encountered data which it did not experience during training. In this case, more information / data is needed for the model to make a decision. This can be particularly important if this decision concerns individual's safty such as in health care or self-driving cars.
In this section we will try to implement some simple MCMC sampler using just Python code. Then we use the sampler to draw samples from a Posterior distribution and compare the results with the Posterior obtained by PyMC3. We will start with loading Python modules and building a toy data set. The toy data set is just a few points drawn from the Normal distribution, let us check how it looks like by plotting the corresponding histogram:
%matplotlib inline
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
sns.set_context('talk')
np.random.seed(123)
data = np.random.normal(size=20)
sns.distplot(data, rug = True)
plt.show()
Now let us define the model. We assume the Likelihood of observing the data x has Normal distributions with two parameters $\mu$ and $\sigma$, i.e. it has the form $L(x | \mu,\sigma)$. Let us assume that in the scaled case $\sigma = 1$, the Likelihood is $L(x|μ,\sigma) ∼ Normal(x | μ,1)$. Now we have to put a Prior distribution on the mean value $\mu$, we will assume that it follows a Normal distribution as well with mean = 0 and standard deviation = 1, thus our model is given by:
$$ \rm{L}(x|μ,\sigma) ∼ Normal(x | μ,\sigma) = \frac{1}{\sqrt{2\pi\sigma^2}}\prod_{i=1} \exp{\left\{-\frac{(x_i-\mu)^2}{2\sigma^2}\right\}} \\ \rm{Prior}(μ) ∼ Normal(\xi,\tau) = \frac{1}{\sqrt{2\pi\tau^2}}\exp{\left\{-\frac{(\mu-\xi)^2}{2\tau^2}\right\}} $$Since we used a conjugate Prior, we can compute the integral in the Bayes formula analytically, derivation is here. The final Posterior distribution will follow Normal distribution:
$$ \rm{Posterior}(μ|x) ∼ Normal(\rm{M},\Sigma) = \frac{1}{\sqrt{2\pi\Sigma^2}}\exp{\left\{-\frac{(\mu-\rm{M})^2}{2\Sigma^2}\right\}} \\ M = \frac{\displaystyle\tau^2N\bar{x}+\sigma^2\xi}{\displaystyle\tau^2N+\sigma^2} \\ \Sigma^2 = \frac{\displaystyle\sigma^2\tau^2}{\displaystyle\tau^2N+\sigma^2} $$So here we will define a function which uses the result of analytical Posterior calculation:
def analytical_posterior(mu, data, xi, tau, sigma):
N = len(data)
x_bar = (1/N)*(np.sum(data))
M = (x_bar*N*tau**2 + xi*sigma**2) / (N*tau**2 + sigma**2)
Sigma = np.sqrt(((sigma**2)*(tau**2)) / (N*tau**2 + sigma**2))
return sp.stats.norm(M, Sigma).pdf(mu)
mu = np.linspace(-1, 1, 500)
plt.plot(mu, analytical_posterior(mu, data, 0, 1, 1))
plt.xlabel('$\mu$')
plt.ylabel('Number of Observations')
plt.title('Analytical Posterior Distribution')
plt.show()
Now let us implement our own Metropolis sampler and sample from the Posterior and compare the results with the analytically derived Posterior:
def Metropolis_Sampler(data, mu_init = 1, n_samples = 10):
mu_current = mu_init
posterior_sample = [mu_current]
for i in range(n_samples):
mu_next = sp.stats.norm(mu_current, 0.5).rvs()
prior_current = sp.stats.norm(0, 1).pdf(mu_current)
prior_next = sp.stats.norm(0, 1).pdf(mu_next)
likelihood_current = sp.stats.norm(mu_current, 1).pdf(data).prod()
likelihood_next = sp.stats.norm(mu_next, 1).pdf(data).prod()
posterior_current = likelihood_current*prior_current
posterior_next = likelihood_next*prior_next
posterior_accept = posterior_next / posterior_current
if posterior_accept > np.random.rand():
mu_current = mu_next
posterior_sample.append(mu_current)
return posterior_sample
posterior_samples = Metropolis_Sampler(data, 1, 15000)
plt.plot(posterior_samples)
plt.xlabel('Sample')
plt.ylabel('$\mu$')
plt.title('Trace Plot')
plt.show()
sns.distplot(posterior_samples, rug = True)
plt.show()
ax = plt.subplot()
sns.distplot(posterior_samples, ax=ax, label='Estimated Posterior')
x = np.linspace(-1, 1, 500)
ax.plot(x, analytical_posterior(x, data, 0, 1, 1), 'g', label='Analytic Posterior')
_ = ax.set(xlabel='$\mu$', ylabel='Belief');
ax.legend();
plt.show()
Now let us use PyMC3 and compare sampling done by PyMC3 with the hand-written Metropolis sampler and the analytical Posterior distributions:
import pymc3 as pm
with pm.Model():
mu = pm.Normal('mu', 0 ,1)
sigma = 1
posterior = pm.Normal('posterior', mu=mu, sd=sigma, observed=data)
trace = pm.sample(15000, pm.Metropolis())
ax = plt.subplot()
sns.distplot(trace['mu'], label='PyMC3 sampler');
sns.distplot(posterior_samples, label='Hand-written sampler');
ax.plot(x, analytical_posterior(x, data, 0, 1, 1), 'g', label='Analytic Posterior')
ax.legend();
We conclude that analytical Posterior, hand-written Metropolis sampler and PyMC3 sampler produce pretty much overlapping Posterior distributions.
A common applied statistics task involves building regression models to characterize non-linear relationships between variables. It is possible to fit such models by assuming a particular non-linear functional form, such as a sinusoidal, exponential, or polynomial function, to describe one variable’s response to the variation in another. Unless this relationship is obvious from the outset, however, it involves possibly extensive model selection procedures to ensure the most appropriate model is retained. Alternatively, a non-parametric approach can be adopted by defining a set of knots across the variable space and use a spline or kernel regression to describe arbitrary non-linear relationships. However, knot layout procedures are somewhat ad hoc and can also involve variable selection. A third alternative is to adopt a Bayesian non-parametric strategy, and directly model the unknown underlying function. For this, we can employ Gaussian process models.
Describing a Bayesian procedure as “non-parametric” is something of a misnomer. The first step in setting up a Bayesian model is specifying a full probability model for the problem at hand, assigning probability densities to each model variable. Thus, it is difficult to specify a full probability model without the use of probability functions, which are parametric! In fact, Bayesian non-parametric methods do not imply that there are no parameters, but rather that the number of parameters grows with the size of the dataset. Rather, Bayesian non-parametric models are infinitely parametric.
What if we chose to use Gaussian distributions to model our data?
$$ p(x \mid \pi, \Sigma) = (2\pi)^{-k/2}|\Sigma|^{-1/2} \exp\left\{ -\frac{1}{2} (x-\mu)^{\prime}\Sigma^{-1}(x-\mu) \right\} $$There would not seem to be any gain in doing this, because normal distributions are not particularly flexible distributions in and of themselves. However, adopting a set of Gaussians (a multivariate normal vector) confers a number of advantages. First, the marginal distribution of any subset of elements from a multivariate normal distribution is also normal: $$ p(x,y) = \mathcal{N}\left(\left[{ \begin{array}{c} {\mu_x} \ {\mu_y} \ \end{array} }\right], \left[{ \begin{array}{cc} {\Sigma_x} & {\Sigma_{xy}} \\ {\Sigma_{xy}^T} & {\Sigma_y} \end{array} }\right]\right) $$
$$ p(x) = \int p(x,y) dy = \mathcal{N}(\mu_x, \Sigma_x) $$Also, conditional distributions of a subset of the elements of a multivariate normal distribution (conditional on the remaining elements) are normal too:
$$ p(x|y) = \mathcal{N}(\mu_x + \Sigma_{xy}\Sigma_y^{-1}(y-\mu_y),\Sigma_x-\Sigma{xy}\Sigma_y^{-1}\Sigma{xy}^T) $$A Gaussian process generalizes the multivariate normal to infinite dimension. It is defined as an infinite collection of random variables, with any marginal subset having a Gaussian distribution. Thus, the marginalization property is explicit in its definition. Another way of thinking about an infinite vector is as a function. When we write a function that takes continuous values as inputs, we are essentially implying an infinite vector that only returns values (indexed by the inputs) when the function is called upon to do so. By the same token, this notion of an infinite-dimensional Gaussian represented as a function allows us to work with them computationally: we are never required to store all the elements of the Gaussian process, only to calculate them on demand.
So, we can describe a Gaussian process as a distribution over functions. Just as a multivariate normal distribution is completely specified by a mean vector and covariance matrix, a GP is fully specified by a mean function and a covariance function:
$$ p(x) \sim \mathcal{GP}(m(x), k(x,x^{\prime})) $$It is the marginalization property that makes working with a Gaussian process feasible: we can marginalize over the infinitely-many variables that we are not interested in, or have not observed.
For example, one specification of a GP might be:
$$ m(x) =0 $$$$k(x,x^{\prime}) = \theta_1\exp{\left(-\frac{\theta_2}{2}(x-x^{\prime})^2\right)} $$Here, the covariance function is a squared exponential, for which values of $x$ and $x^{\prime}$ that are close together result in values of $k$ closer to one, while those that are far apart return values closer to zero. It may seem odd to simply adopt the zero function to represent the mean function of the Gaussian process — surely we can do better than that! It turns out that most of the learning in the GP involves the covariance function and its hyperparameters, so very little is gained in specifying a complicated mean function.
For a finite number of points, the GP becomes a multivariate normal, with the mean and covariance as the mean functon and covariance function, respectively, evaluated at those points.
from sklearn import gaussian_process
from sklearn.gaussian_process.kernels import Matern, WhiteKernel, ConstantKernel, RBF, DotProduct
scikit-learn offers a library of about a dozen covariance functions, which they call kernels, to choose from. A flexible choice to start with is the Matèrn covariance.
$$ k_{M}(x) = \frac{\sigma^2}{\Gamma(\nu)2^{\nu-1}} \left(\frac{\sqrt{2 \nu} x}{l}\right)^{\nu} K_{\nu}\left(\frac{\sqrt{2 \nu} x}{l}\right) $$where $\Gamma$ is the gamma function and $K$ is a modified Bessel function. The form of covariance matrices sampled from this function is governed by three parameters, each of which controls a property of the covariance.
amplitude ($\sigma$) controls the scaling of the output along the y-axis. This parameter is just a scalar multiplier, and is therefore usually left out of implementations of the Matèrn function (i.e. set to one)
lengthscale ($l$) complements the amplitude by scaling realizations on the x-axis. Larger values push points closer together along this axis.
roughness ($\nu$) controls the sharpness of ridges in the covariance function, which ultimately affect the roughness (smoothness) of realizations.
Though in general all the parameters are non-negative real-valued, when $\nu = p + 1/2$ for integer-valued $p$, the function can be expressed partly as a polynomial function of order $p$ and generates realizations that are $p$-times differentiable, so values $\nu \in {3/2, 5/2}$ are most common.
A GP kernel can be specified as the sum of additive components in scikit-learn simply by using the sum operator, so we can include a Matèrn component (Matern), an amplitude factor (ConstantKernel), as well as an observation noise (WhiteKernel):
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (15,10)
N = 100
a = tf.placeholder("float")
b = tf.placeholder("float")
c = tf.placeholder("float")
d = tf.placeholder("float")
x = tf.linspace(-5.0, 10.0, num = N, name = "x")
noise = tf.random_normal(shape = [N,], mean = 0, stddev = 50.0, seed = 1)
# y = a*x^3 + b*x^2 + c*x +d
with tf.Session() as sess:
y = sess.run(tf.add(tf.add(tf.add(tf.add(tf.multiply(a,tf.pow(x,3)),tf.multiply(b,tf.pow(x, 2))),tf.multiply(c,x)),d),
noise), feed_dict= {a: -1, b: 10, c: 1, d: 1} )
x = np.reshape(np.array(x.eval()), (N,1))
y = np.reshape(np.array(y) / np.max(y), (N,1))
fig = plt.figure(figsize=(12, 8))
plt.scatter(x, y)
plt.xlabel("x",fontsize = 40)
plt.ylabel("y",fontsize = 40, rotation = 0)
plt.show()
kernel = Matern(length_scale = 1, nu = 3/2)
#kernel = ConstantKernel(1.0, (1e-3, 1e3)) * RBF(1, (1e-2, 1e2))
#kernel = ConstantKernel() + Matern(length_scale=1, nu=3/2) + WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(kernel = kernel, n_restarts_optimizer = 9, alpha=0.1)
#gp = gaussian_process.GaussianProcessRegressor()
gp.fit(x, y)
gp.kernel
Along with the fit method, each supervised learning class retains a predict method that generates predicted outcomes ($y^{\ast}$) given a new set of predictors ($X^{\ast}$) distinct from those used to fit the model. For a Gaussian process, this is fulfulled by the posterior predictive distribution, which is the Gaussian process with the mean and covariance functions updated to their posterior forms, after having been fit.
$$ p(y^{\ast}|y, x, x^{\ast}) = \mathcal{GP}(m^{\ast}(x^{\ast}), k^{\ast}(x^{\ast})) $$where the posterior mean and covariance functions are calculated as:
$$ m^{\ast}(x^{\ast}) = k(x^{\ast},x)^T[k(x,x) + \sigma^2I]^{-1}y $$$$ k^{\ast}(x^{\ast}) = k(x^{\ast},x^{\ast})+\sigma^2 – k(x^{\ast},x)^T[k(x,x) + \sigma^2I]^{-1}k(x^{\ast},x) $$x_pred = np.linspace(-5, 10, num = 100).reshape(-1,1)
y_pred, sigma = gp.predict(x_pred, return_std=True)
print(y_pred[0:5])
print(sigma[0:5])
plt.scatter(x, y)
plt.plot(x_pred, y_pred, c='r')
plt.plot(x_pred, y_pred + 2*sigma, c='g')
plt.plot(x_pred, y_pred - 2*sigma, c='g')
plt.xlabel("x",fontsize = 20)
plt.ylabel("y",fontsize = 20, rotation = 0)
plt.show()